Batch Policy Iteration Algorithms for Continuous Domains
نویسندگان
چکیده
This paper establishes the link between an adaptation of the policy iteration method for Markov decision processes with continuous state and action spaces and the policy gradient method when the differentiation of the mean value is directly done over the policy without parameterization. This approach allows deriving sound and practical batch Reinforcement Learning algorithms for continuous state and action spaces.
منابع مشابه
Fitted Q-iteration in continuous action-space MDPs
We consider continuous state, continuous action batch reinforcement learning where the goal is to learn a good policy from a sufficiently rich trajectory generated by some policy. We study a variant of fitted Q-iteration, where the greedy action selection is replaced by searching for a policy in a restricted set of candidate policies by maximizing the average action values. We provide a rigorou...
متن کاملApproximate Dynamic Programming and Reinforcement Learning
Dynamic programming (DP) and reinforcement learning (RL) can be used to address problems from a variety of fields, including automatic control, artificial intelligence, operations research, and economy. Many problems in these fields are described by continuous variables, whereas DP and RL can find exact solutions only in the discrete case. Therefore, approximation is essential in practical DP a...
متن کاملPERFORMANCE OF DIFFERENT ANT-BASED ALGORITHMS FOR OPTIMIZATION OF MIXED VARIABLE DOMAIN IN CIVIL ENGINEERING DESIGNS
Ant colony optimization algorithms (ACOs) have been basically introduced to discrete variable problems and applied to different research domains in several engineering fields. Meanwhile, abundant studies have been already involved to adapt different ant models to continuous search spaces. Assessments indicate competitive performance of ACOs on discrete or continuous domains. Therefore, as poten...
متن کاملBatch-Switching Policy Iteration
Policy Iteration (PI) is a widely-used family of algorithms for computing an optimal policy for a given Markov Decision Problem (MDP). Starting with an arbitrary initial policy, PI repeatedly updates to a dominating policy until an optimal policy is found. The update step involves switching the actions corresponding to a set of “improvable” states, which are easily identified. Whereas progress ...
متن کاملKernel Rewards Regression: An Information Efficient Batch Policy Iteration Approach
We present the novel Kernel Rewards Regression (KRR) method for Policy Iteration in Reinforcement Learning on continuous state domains. Our method is able to obtain very useful policies observing just a few state action transitions. It considers the Reinforcement Learning problem as a regression task for which any appropriate technique may be applied. The use of kernel methods, e.g. the Support...
متن کامل